1 Introduction

Overview and Motivation


The objective of this project is to analyse the crime situation in the city of San Francisco. This city is known for its cultural richness as well as its economic strength. A lot of our current media landscape seems to be filled with stories of violence in inner cities, so we will start off by looking at the trends of crime over time, with data from the last 20 years. Our analysis will focus on the interactions of crime with three different phenomena: socio-demographic characteristics, transportation and the COVID-19 crisis. We have an intuition that interconnectedness and density play a significant role for all of these different elements, and we will see where the empirical data leads us. Having collected data on the different socio-economic characteristics of the city’s 41 neighborhoods (age, income, educational attainment etc.), we will attempt to predict the neighborhoods with the highest crime rates based these variables. We will also look at possible correlations between public transportation density and crime. Finally, we will focus on whether the current pandemic and the associated lockdowns had any effect on crime rates. One of us took a criminology course during his bachelor’s degree and was deeply interested in finding explanations for crime. Seeking to understand the why of crime is research that could be really useful for other studies and there is still a lot to be discovered.

2 Data

Raw Data Set


2.1 Crime Data

First, we will load two data sets that record the incident reports that have been filed to the police.


2.1.1 2018 To Present


Source of the data set: [https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783]

This dataset covers the period from the 1st January 2018 to the day in 2020 where we’ve downloaded the data and contains about 408K observations.


  • We will focus on these variables:Incident Date, Incident Time, Incident Day of Week, Incident Category, Resolution, Analysis Neighborhood, point.

    • Incident Datetime The date and time when the incident occurred
    • Incident Date The date the incident occurred
    • Incident Time The time the incident occurred
    • Incident Year The year the incident occurred, provided as a convenience for filtering
    • Incident Day of WeekThe day of week the incident occurred
    • Report Datetime Distinct from Incident Datetime, Report Datetime is when the report was filed.
    • Incident Category A category mapped on to the Incident Code used in statistics and reporting.
    • Incident Subcategory A subcategory mapped to the Incident Code that is used for statistics and reporting.
    • Incident Description The description of the incident that corresponds with the Incident Code.
    • Resolution The resolution of the incident at the time of the report.
    • Analysis Neighborhood This field is used to identify the neighborhood where each incident occurs. Neighborhoods and boundaries are defined by the Department of Public Health and the Mayor’s Office of Housing and Community Development. Please reference the link below for additional info: [https://data.sfgov.org/d/p5b7-5n3h].
    • Latitude The latitude coordinate
    • Longitude The longitude coordinate
    • point The point geometry used for mapping features in the open data portal platform. Latitude and Longitude are provided separately as well as a convenience.


In order to be able to join the two datasets that have a different temporality and thus focus on variables to answer the chosen research questions, we have removed a few columns.
This information will be useful to us in order to look at the evolution of crime over time and also to analyze crime by neighborhood and thus be able to analyze whether this variation in crime across neighborhoods can be explained by socio-economic variables.

2.1.2 2003 To 2018

Source of the data set: [https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry]

This dataset covers the period from the 1st January 2003 to the 15th May 2018 and contains about 2.16M observations

As for the 3.1.1 data set, we will focus on these variables:Incident Category, DayOfWeek, DayOfWeek, Date, Time, Resolution, Y, X, location,Analysis Neighborhoods 2 2.

  • Incident Category A category mapped on to the Incident Code used in statistics and reporting.
  • Descript The description of the incident that corresponds with the Incident Code.
  • DayOfWeek The day of week the incident occurred
  • Date The date the incident occurred
  • Time The time the incident occurred
  • Resolution The resolution of the incident at the time of the report.
  • Y The latitude coordinate
  • X The longitude coordinate
  • location The point geometry used for mapping features in the open data portal platform. Latitude and Longitude are provided separately as well as a convenience.
  • Analysis Neighborhoods 2 2 This field is used to identify the neighborhood where each incident occurs. Neighborhoods and boundaries are defined by the Department of Public Health and the Mayor’s Office of Housing and Community Development. Please reference the link below for additional info: [https://data.sfgov.org/d/p5b7-5n3h].


For the Analysis Neighborhoods 2 2, we had to harmonize the numbers found in this column with the neighborhood names and make sure that we had the correct ones.


2.2 Socio-Economic Data


2.2.1 2012 To 2016

Then, we will load four data sets that record the evolution of some socio-economic variables for each neighborhood of San Francisco. As these reports - built with census data - were only exported in pdf format, we decided to create our own data set in csv in order to use the information from these reports for different periods.

Source of the data set: [https://default.sfplanning.org/publications_reports/SF_NGBD_SocioEconomic_Profiles/2012-2016_ACS_Profile_Neighborhoods_Final.pdf]

2.2.2 2010 To 2014

#> Warning: Missing column names filled in: 'X39' [39]

Source of the data set: [https://default.sfplanning.org/publications_reports/SF_NGBD_SocioEconomic_Profiles/2011-2015_ACS_Profile_Neighborhoods_Final.pdf]

2.2.4 2005 To 2009

Source of the data set: [https://sf-planning.org/sites/default/files/FileCenter/Documents/8501-SFProfilesByNeighborhoodForWeb.pdf]


  • Column of the data sets:

    • NEIGHBOORHOD Each neighborhood of San Franciso. Segmented using the geospatial data in this link : [https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods/p5b7-5n3h]
    • Time Horizon This column highlights the time horizon where the data have been collected.
    • Total Population This column highlights the total population for each neighbourhood.
    • Households This column highlights the total households number for each neighbourhood.
    • Family Households This column highlights the proportion of family-households within a neighborhood.
    • Non-Family Households This column highlights the proportion of non-family-households within a neighbourhood
    • Average Household Size This column highlights the average household size for each neighbourhood.
    • Asian proportion of Asian people within the neighborhood
    • Black/African American proportion of Black/African American people within the neighborhood
    • White proportion of White people within the neighborhood
    • Native American Indian proportion of Native American Indian people within the neighborhood
    • Native Hawaiian/Pacific Islander proportion of Native Hawaiian/Pacific Islander people within the neighborhood
    • Other/Two or More Races proportion of Other/Two or More Races people within the neighborhood
    • Latino (of Any Race) proportion of Latino (of Any Race) people within the neighborhood
    • 0-4 years proportion of people with age = [0-4] within the neighborhood
    • 5-17 years proportion of people with age = [5-17] within the neighborhood
    • 18-34 years proportion of people with age = [18-34] within the neighborhood
    • 35-59 years proportion of people with age = [35-59] within the neighborhood
    • 60 and older proportion of people with age = [60-older] within the neighborhood
    • Median Age This column highlights the median age within a neighbourhood
    • High School or Less proportion of people with only a High School degree or less within the neighborhood
    • Some College/Associate Degree proportion of people with only associate College degree or less within the neighborhood
    • College Degree proportion of people with only College Degree within the neighborhood
    • Graduate/Professional Degree proportion of people with only Graduate/Professional Degree within the neighborhood
    • Foreign Born proportion of foreign born people within the neighborhood
    • English Only proportion of english only speakers people within the neighborhood
    • Spanish Only proportion of spanish only speakers people within the neighborhood
    • Asian/Pacific Islander proportion of asian only speakers people within the neighborhood
    • Other European Languages Only proportion of european languages only speakers people within the neighborhood
    • Other Languages proportion of other languages only speakers people within the neighborhood
    • Units of Housing this column highlights the number of housings for each neighbourhood.
    • Median Year Structure Build this column highlights the median year structure housing for each neighbourhood.
    • Median Rent this column highlights the median housing rent for each neighbourhood.
    • Median Home Value this column highlights the median home value for each neighbourhood.
    • Median Household Income this column highlights the median household income for each neighbourhood.
    • Percent in Poverty this column highlights the % in poverty for each neighbourhood.
    • Unemployment Rate this column highlights the % of unemployment for each neighbourhood.
    • Population Density per Acre this column highlights the number of people per Acre for each neighborhood.


With all these variables, we can see each neighborhood in terms of the number and density of the population, household composition, race, age distribution, education, languages spoken, and the economy of the neighborhood. This will be very helpful for answering some of our research questions.


2.2.5 COVID


Source of the data set: [https://catalog-next.data.gov/dataset/covid-19-cases-and-deaths-summarized-by-geography]

  • Column of the data sets:

    • count number of cases
    • acs_population Population
    • multypoligon geo-spacial Information

Source of the data set: [https://data.sfgov.org/COVID-19/COVID-19-Cases-Summarized-by-Date-Transmission-and/tvq9-ec9w]

  • we’ll use this data set to compare crime and covid cases.


2.3 Transportion Data

2.3.1 Public transportion

Next, we will upload a dataset that counts the number of transit stops in SFMTA system. We will use this data to analyze the transit situation by neighborhood.

Source of the data set: [https://catalog.data.gov/dataset/muni-stops].


  • Here are the columns we are going to focus on::
    • STOPNAME The names of all the transit stops
    • shape The point geometry used for mapping features
    • Analysis Neighborhoods This column indicates in which neighborhood of San Franciso the transit stops is.


The other columns of this data set are not relevant for our research project. We will only use this data set to see if the density of the number of public transport stops coincides with the neighborhoods with the most crime. This data set will also be useful to analyze the density of the public transport network and find a comparison with the neighbourhoods with the most crime.


2.4 Geospacial Data

2.4.1 Neighborhoods

#> Reading layer `Analysis Neighborhoods' from data source `/Users/ROUGE/Desktop/STUDY HARD/Master/Data_Science/dsfba_project/data/Analysis Neighborhoods.geojson' using driver `GeoJSON'
#> Simple feature collection with 41 features and 1 field
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -123 ymin: 37.7 xmax: -122 ymax: 37.8
#> geographic CRS: WGS 84
#> Simple feature collection with 41 features and 1 field
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -123 ymin: 37.7 xmax: -122 ymax: 37.8
#> geographic CRS: WGS 84
#> First 10 features:
#>                             nhood                       geometry
#> 1           Bayview Hunters Point MULTIPOLYGON (((-122 37.8, ...
#> 2                  Bernal Heights MULTIPOLYGON (((-122 37.7, ...
#> 3             Castro/Upper Market MULTIPOLYGON (((-122 37.8, ...
#> 4                       Chinatown MULTIPOLYGON (((-122 37.8, ...
#> 5                       Excelsior MULTIPOLYGON (((-122 37.7, ...
#> 6  Financial District/South Beach MULTIPOLYGON (((-122 37.8, ...
#> 7                       Glen Park MULTIPOLYGON (((-122 37.7, ...
#> 8                Golden Gate Park MULTIPOLYGON (((-122 37.8, ...
#> 9                  Haight Ashbury MULTIPOLYGON (((-122 37.8, ...
#> 10                   Hayes Valley MULTIPOLYGON (((-122 37.8, ...
#> Coordinate Reference System:
#>   User input: WGS 84 
#>   wkt:
#> GEOGCRS["WGS 84",
#>     DATUM["World Geodetic System 1984",
#>         ELLIPSOID["WGS 84",6378137,298.257223563,
#>             LENGTHUNIT["metre",1]]],
#>     PRIMEM["Greenwich",0,
#>         ANGLEUNIT["degree",0.0174532925199433]],
#>     CS[ellipsoidal,2],
#>         AXIS["geodetic latitude (Lat)",north,
#>             ORDER[1],
#>             ANGLEUNIT["degree",0.0174532925199433]],
#>         AXIS["geodetic longitude (Lon)",east,
#>             ORDER[2],
#>             ANGLEUNIT["degree",0.0174532925199433]],
#>     ID["EPSG",4326]]

Source of the data set: [https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods/p5b7-5n3h]

This data set contains multi-polygons that are supposed to represent the segmentation of neighbourhoods corresponding to the Analysis Neighborhood variable. We will use this data set extensively for our geospatial analyses.



2.5 Data Set Cleaning

2.5.1 2003-2020 Crime


Here is our cleaned data set which takes into account all crimes from 2003 to 2020. We highlight the first 5 observations. We will use it to look at the trends for crime.


Incident Date Incident Time Incident Day of Week Incident Category Incident Description Resolution lon lat Analysis Neighborhood
2020-08-15 12:43:00 Saturday Assault Battery OPEN 37.71603881888 -122.4402551358 Excelsior
2018-01-18 19:00:00 Thursday Lost Property Lost Property OPEN NA NA NA
2020-08-16 03:13:00 Sunday Assault Firearm, Discharging in Grossly Negligent Manner OPEN 37.75482657771 -122.3977287339 Potrero Hill
2020-08-16 03:38:00 Sunday Malicious Mischief Malicious Mischief, Breaking Windows OPEN 37.76653957530 -122.4220438145 Mission
2020-08-15 09:40:00 Saturday Larceny/Theft Theft, From Locked Vehicle, >$950 OPEN NA NA NA
2020-08-16 13:40:00 Sunday Non-Criminal Mental Health Detention OPEN 37.78404443716 -122.4037117546 Financial District

As you can see here, there are a few NA values for the latitude and longitude variables. For the purpose of creating a time series graph, this is not a problem, and thus we can keep them. However, for geospacial visualization, we’re going create a table where these NA values are removed.



2.5.2 Transportation


Here is our cleaned data set that counts the number of public transportation stops in San Francisco. We took the time to modify the numbers associated to each neighborhood by changing them with their associated names. We will use this data set to see if there is a correlation between crime and public transportion density.


Analysis Neighborhoods number of stops
Sunset/Parkside 263
Bayview Hunters Point 256
West of Twin Peaks 236
Financial District 168
Mission 149
Castro/UpperMarket 134


2.5.3 Socio-Economic & Crime Data Combined


This data set highlights the number of crimes by neighborhood for the years 2014 to 2016. We have combined the crime information with our socio-economic data. This dataset will be useful to build a linear regression as well as for the geospatial analysis.



This data set highlights the number of COVID cases by neighborhood. This dataset will be useful to answer our second research questions.



This data set highlights the number of COVID cases in San Francisco. This dataset will be useful to answer our second research questions.


date Case Count
2020-03-10 6
2020-03-11 9
2020-03-12 6
2020-03-13 16
2020-03-14 10
2020-03-15 11



Now that our data has been cleaned up, we can move on to the anaytical parts of our project.


3 Exploratory data analysis

3.1 Crime EDA


For the first part of the exploratory data analysis, we are going look at the evolution of crime over time. Can we see patterns that recur at certain times of the day, at certain times of the year?


3.1.1 Crime by Date/Time

3.1.1.1 Crime by Hour

hour Number of Crime
18 159418
17 153601
12 152153
19 143477
16 142237
15 136090

This table shows the number of crimes falling within 1 hour intervals. There are peaks in recorded crime rates in the [3pm-8pm] time interval, at noon and midnight. It is interesting to note that the 3 time intervals that have the highest rates of crime are 6pm, 5pm and 12pm. For instance, this corresponds to moments of the day when workers have breaks or have finished working.


We can also show this same data with a barplot.



We can see that starting from 8:00 a.m., crimes start increasing. We see a peak at 12 o’clock, and then another one during [3pm-8pm] interval . What is interesting to note is that there is less crime at night, since one might think that more crime happens at night because of the dark and nightlife activities.


We also wanted to see if the number of crimes was different on different days of the week.




For the days of the week (Monday, Tuesday, Wednesday, Thursday), we see the same trend: a gradual increase until 6pm, with a sudden peak at noon and a gradual decrease after 6pm. On the other hand, for Friday, Saturday and Sunday the increase lasts until later in the evening, which is preumably because of the more active nightlife.



3.1.1.2 Crime by Day

We are now going to look at the distribution of crimes according to the days of the week. One might think that crimes increase on Fridays and Saturdays because that’s when people go out the most. But is this really the case?



There is definitely more crime on Fridays than on any other day of the week. However, the variation between days is not obvious.

3.1.1.3 Crime by Month (2016 Focus)

We will now look at the distribution of the number of crimes per month, to see if there was an increase or decrease in some months. As we will refine our analysis later on for 2016, we will analyze the variation in the number of crimes per month for this xear.


We don’t see any particular pattern. There is some variation but it deems to be mostly noise. One would have thought that in months with warm weather we might see an increase in crime because people are going out more and are more active. But the variation between the months is small and inconclusive.



If we look at how the number of crimes by day and by month, we still don’t see any clear pattern and the variation seems to be constant.


3.1.1.4 Crime by Year [2003-2020]


We will look at the evolution of the number of crimes per month over a period of 13 years. We will then draw a time graph over the period [2003-2020].



There is minor variation (with greater or lesser intensity) between months over time, and we have added a smoothing line to these temporal data. With this smoothing, we can see a certain sinusoidal variation with a decrease in the smoothed line when we get to the crisis years of 2008, followed by an increase until 2016, when we reach a maximum, then a descent. Note the clear and precise fall when we get to 2020: we’ll come back to this later.


3.1.2 Crime by Category


We will now turn our attention to the categories of crime. What type of crime is San Francisco most affected by?



The trends are clear! The vast majority of crime in San Francisco is theft. The Theft bar of the barplot clearly dominates the others. If we also take into account the Motor Vehicle Theft column, which can be considered theft, we can conclude that property crimes constitute the largest percentage of crimes in San Francisco.


We also wanted to see if the trend in the number of crimes was different on a different hour of the day, given the crime category.



This visualization does not show clear trends for all categories, but we can see that, for Larceny/Theft, there is a peak at noon and between 5 and 9 pm.



The number of crimes was different on different day of the week, given the crime category.



With these representations of crimes according by days and by category, we see that most crimes are concentrated on Fridays and Saturdays. For assaults, this changes a little, we see more of them on Saturdays and Sundays. But all in all, the majority of crimes take place during the weekend.

3.1.3 Crime by Neighborhood


Now we will do some geographic visualizations that will allow us to better understand where the places with the most crime in San Francisco are.


The neighbourhoods with the most crime are Tenderloin, South of Market and Mission, along with Financial District and Bayview Hunters Point.




We built an interactive map of San Francisco highlighting the transportation infrastructure. You can walk around the map by zooming in on the city’s neighborhoods. By segmenting the city by neighborhood, we have highlighted the neighborhoods where there is the most crime, with darker shades of blue indicating more crime. We can see that the neighborhoods with the most crime are those with the most black (large and highly active) roads. We willl take a deeper look into the relationship between the organizationof public transportation and crime later. The three dark blue neighbourhoods are the ones previously mentioned: Tenderloin, South of Market, Mission. You can also see orange bubbles, that group together the crimes that took place in San Francisco in 2016. This is another way of displaying the data that we show with the blue gradients, but with a more precise geospatial display.



3.1.4 Crime Resolution

While doing some research on the crime situation in San Francisco, we came across an article ([https://www.sfchronicle.com/bayarea/philmatier/article/SF-ranks-high-in-property-crime-while-it-ranks-14439369.php]) that explained that criminals often are not arrested. We wanted to see if this data had the same conclusions.


We see very clearly that the majority of crimes have no follow-up, with less than a quarter leading to arrests. We want to analyze this further, are certain categories more affected by the lack of follow-up after reported crimes?


There is no follow-up for the large majority of the thefts (normal or those of vehicles), the most common crime, and the majority of thieves don’t get caught. It would be interesting for further research to see if clearance rates for theft is similarly low in other american cities.

3.2 Transport Across Neighborhood


We start from the premise that public transport is a factor that makes certain actvities easier for criminals. It’s easier to steal in transportation because a lot of people are packed in tight spaces. It’s also easier to steal in neighborhoods where there is public transportation because you can run away very quickly. We wanted to see if there was any connection between the number of transport stops and the number of crimes.

Violent crimes number of stops Analysis Neighborhood
1548 263 Bayview Hunters Point
460 256 Bernal Heights
513 236 Castro/Upper Market
291 168 Chinatown
471 149 Excelsior
1134 134 Financial District/South Beach

With only this table, it’s hard to conclude anything. It would still seem that some neighbourhoods with a lot of crime are not at the top of the rankings when it comes to the number of public transport stops.

3.3 COVID data


We first look at city level COVID-19 cases over time.


The zig-zagging nature of this time series probably reflects lower levels of data collection on weekends. We can still clearly observe relative highs in April, July and November for COVID-19 cases in San Francisco.

Then, we look at neighborhood-level cumulative COVID-19 cases since the beginning of the pandemic, relative to those neighborhoods populations.


Even after controlling for the population and displaying the rate of infections per neighborhood, we can see that the virus has affected San Francisco unequally, with a difference of more than five between the most and least affected neighborhoods.

3.4 Socio-Demographic data


Since we have 41 neighborhoods, we have decided to limit our EDA for the socio-economic data to the 6 biggest neighborhoods in terms of population.

3.4.1 Household Family Structure



We can see here that while in most of the neighborhoods, family households are the majority, there are exceptions to this. This may be an indication of whether the neighborhood is a relatively lively area filled with students, young professionals and service workers or a quieter area where people with families live.

3.4.2 Race/Ethnicity



San Francisco is a very ethnically diverse city, and as in much of California, Asians constitute a much higher percentage of the population than in the rest of the United States. We can still see here that certain populations are concentrated in certain areas. For instance, among the six neighborhoods, only Bayview Hunters Point has a significant black population.

3.4.3 Age Structure



Age structure can also tell us things about the nature of a neighborhood. Mission, which was the only one where non-family households outnumber ones with families, unsurprisingly has the highest percentage of people in prime working age (18 to 59 years old). Among the other ones we can see that some (Bayview Hunters Point in particular), have a lot of children and teenagers, while others have many people who are at retirement age or close to it.

3.4.4 Educational Attainment



College-educated people are often concentrated in big cities such as San Francisco, so it is interesting to see the areas where there are relatively few of them. Two neighborhoods stand out here. Based on the previous age structure data, we can infer that in one of these (Bayview Hunters Point), this might partly reflect that a higher part of the population is too young to have gone to/finished college, while in the other (Excelsior), this indicates a relative concentration of adults with lower levels of educational attainment than in other parts of the city.

3.4.5 Household Income

Unsurprisngly, considering the fact that people with college educations usually have higher-paying jobs, the neighborhood with highest college-educated share (West of Twin Peaks) of the population is the wealthiest and the two aforementioned neighborhoods with lowest share of college-educated people have the lowest median household incomes. It is important to note that this being household income, you would expect neighborhoods wiith smaller (non-family) households to seem poorer by this measure. And this is indeed what we see with Mission and Outer Richmond: high rates of non-family households and educational attainment, middling household incomes.

4 Analysis

4.1 Research question 1

  • Let’s put ourselves in context. Our first research question is as follows:

Given the organization of public transportion in the city, can we say that the places with the most crime are those that are the most connected by the transportion network?




As we have seen in the EDA of the data set on crime in San Francisco, most crimes are thefts. We wanted to see if, as in some European cities such as Barcelona or Paris, public transportation was a place which facilitated theft. Since the majority of crimes have very little follow-up, we also presumed that criminal activity is not static. Public transportation might therefore be a vector that could influence the choice of target neighborhoods to rob, while also providing a better means of leaving the scene of the crime quickly.


#> [1] 0.084


We see a very small correlation. It would be interesting to push the analysis a bit further.




If we highlight San Francisco’s transportation network by comparing it to the segmentation of neighborhoods according to the number of crimes, we can create this map.



This is the same blue gradient that we have used earlier on to highlight crimes in a neighborhood. The yellow dots represent the public transportation stops in San Francisco. The network is quite dense and no neighborhood is overlooked. But it seems that the neighborhoods with the most stops are not the ones with the most crime. There are a certain number of stops in Tenderloin, South of Market and Mission (the dark blue neighborhoods) but the areas with the most stops (the 37.8°N/37.78°N and 122.45°N/122.4°N square) are not the ones with the most crime.





We can see some small correlation between the number of crimes and the number of bus stops. On the other hand, the 3 neighborhoods with the most crimes, which we have highlighted with 3 arrows, seem to be outliers that are very important for the interpretation of the results. This very small correlation is not very statistically significant given the variability of the observations. We can see that the black area, which represents the standard error, is very large.



#> 
#>  Pearson's product-moment correlation
#> 
#> data:  crime_demostats_2016$`Violent crimes` and crime_demostats_2016$`number of stops`
#> t = 0.5, df = 39, p-value = 0.6
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.230  0.382
#> sample estimates:
#>   cor 
#> 0.084
#> 
#> Call:
#> lm(formula = `Total crimes` ~ `number of stops`, data = crime_demostats_2016)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -3606  -2169  -1348    -16  14044 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)        3369.53    1181.28    2.85   0.0069 **
#> `number of stops`     2.87      11.24    0.26   0.7995   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4300 on 39 degrees of freedom
#> Multiple R-squared:  0.00167,    Adjusted R-squared:  -0.0239 
#> F-statistic: 0.0654 on 1 and 39 DF,  p-value: 0.8



As expected, this small correlation is not statistically significant (p-value = 0.7995 and 0 is in the 95% confidence interval). In addition, the variation in the number of crimes explained by the variation in the number of stops between neighbourhoods is extremely small. (Multiple R-squared = 0.00167). We can therefore conclude that there is no linear relationship between the number of crimes and the number of public transit stops. We cannot conclude that the majority of flights frome crime scenes do not take place that much when travelling by public transport, which was one of our theories.




4.2 Research question 2


Here we will explore whether the COVID-19 pandemic had any effect on crime rates.

We will first compare the total number of crimes citywide in the past 5 years, limiting the time period to the period between March 10th and November 13th so that all numbers are comparable to the start of the pandemic in the United States and the last date of our 2020 crime dataset.

year Total crimes
2015 105429
2016 99693
2017 103105
2018 105208
2019 102845
2020 73034


There seems to be a clear drop in crime in 2020 ove the COVID-19 period relative to previous years. The numbers for this approximately 8-month period in the previous years has always been cloe to 10’000 total crimes, while in 2020 we have have less than 7’500, a more than 25% drop.

In this time series graph, where we compare crimes since the beginning of 2018 and COVID-19 case counts, we can clearly see that once the infections line appears on the graph the crime rate line seems to sustainably drop into a lower zone. While both sets of data have some level of short-term cyclical variation (on weekends, we see more crimes , but fewer recorded coronavirus cases), taking this long view of time we can clearly see that something has changed with crime rates in a more durable way since the start of the pandemic.


This graph shows the timeline of crime rates for 2018, 2019 an 2020 and case counts for 2020 over the course of their respective years. We can see clearly that 2020 was following the same trend in crime rates as in the two previous years in January eand February, there is no real distinction between the three lines. Once the pandemic comes into play, the 2020 crime line becomes clearly distinct, at a lower rate than the two lines representing previous years.


While it would be important to look into the exact timeline of lockdown measures, it would seem that the decrease in interactions since the beginning of the pandemic, through government measures and individual choices, has led to a decrease in reported crime rates.

We will now take a look at the neighborhoods level data and will try to see if the pandemic had any particular effect on the relative crime rates in neighborhoods.

As we can see this relative drop in crime does not seem to have decreased the crime rate more in particularly affected neighborhoods more than in other ones. It would appear that to the extent it had any effect, the COVID-19 pandemic only had a citywide effect on crime. It is interesting however to observe that some of the neighborhoods with highest rate of COVID-19 infections are also some of the neighborhoods that have consistently had the highest rates of crime over the past 5 years (Bayview Hunters Point, Mission, and Tenderloin in particular). Perhaps both viral infections and crime tend to happen in the areas with the most frequent rate of interactions between people. Overall the data here is much less conclusive than the citywide data.


4.3 Research question 3


Does the crime rate vary significantly by neighborhood? Do factors such as education, income, ethnicity of the population, poverty, unemployment rate, population density explain this variation? We will try to predict the number of crimes in a neighbourhood with it’s socio-economic characteristics. Do these factors tell us something about the perpetrators of crimes and their victims ? We will try to see if there is any linear relationship.




Variable Selection


To begin with, the data set we’re going to use has a lot of variables. Instead of making a step-wise selection on the total number of variables, we will first analyze which independent variable is most correlated with our dependent variable. Using the first line of this correlation matrix, we selected the 6th, 13th, 27th, 34th and 38th columns of our data set. We will try to explain the total number of crimes with the variables Non-Family Households, Other/Two or More Races, Units of Housing, Median Household Income and Number of people Poverty.




Model 1


We make our first linear model:

\[ \begin{align} Total Crimes = Non-Family Households + Poverty + Other/Two More Races + Housing+ Income \end{align} \]

#> 
#> Call:
#> lm(formula = `Total crimes` ~ `Number of people Poverty` + `Other/Two or More Races` + 
#>     `Units of Housing` + `Non-Family Households` + `Median Household Income`, 
#>     data = crime_demostats_2014_2016)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -5724  -1260   -342    541  11911 
#> 
#> Coefficients:
#>                              Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                -701.19464 1096.52287   -0.64    0.524
#> `Number of people Poverty`    1.28446    0.29178    4.40 0.000024
#> `Other/Two or More Races`    -0.02091    0.17254   -0.12    0.904
#> `Units of Housing`           -0.30996    0.15369   -2.02    0.046
#> `Non-Family Households`       0.31744    0.12684    2.50    0.014
#> `Median Household Income`     0.00771    0.00993    0.78    0.439
#>                               
#> (Intercept)                   
#> `Number of people Poverty` ***
#> `Other/Two or More Races`     
#> `Units of Housing`         *  
#> `Non-Family Households`    *  
#> `Median Household Income`     
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3060 on 117 degrees of freedom
#> Multiple R-squared:  0.51,   Adjusted R-squared:  0.489 
#> F-statistic: 24.3 on 5 and 117 DF,  p-value: <0.0000000000000002



Not all of our variables are significant. To have a better model, we will use a backward stepwise selection so that all our independent variables are statistically significant. The least significant variable is the one related to the ethnicity of the population. To simplify our model, we will remove this variable.



Model 1.2


\[ \begin{align} Total Crimes = Non-Family Households + Poverty + Housing+ Income \end{align} \]

#> 
#> Call:
#> lm(formula = `Total crimes` ~ `Number of people Poverty` + `Units of Housing` + 
#>     `Non-Family Households` + `Median Household Income`, data = crime_demostats_2014_2016)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -5696  -1311   -345    548  11947 
#> 
#> Coefficients:
#>                              Estimate Std. Error t value  Pr(>|t|)
#> (Intercept)                -671.58725 1064.49340   -0.63     0.529
#> `Number of people Poverty`    1.26582    0.24695    5.13 0.0000012
#> `Units of Housing`           -0.30772    0.15195   -2.03     0.045
#> `Non-Family Households`       0.31502    0.12473    2.53     0.013
#> `Median Household Income`     0.00743    0.00963    0.77     0.442
#>                               
#> (Intercept)                   
#> `Number of people Poverty` ***
#> `Units of Housing`         *  
#> `Non-Family Households`    *  
#> `Median Household Income`     
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3040 on 118 degrees of freedom
#> Multiple R-squared:  0.51,   Adjusted R-squared:  0.493 
#> F-statistic: 30.7 on 4 and 118 DF,  p-value: <0.0000000000000002


This time, the least significant variable is the wage variable. We remove this variable.



Model 1.3


\[ \begin{align} Total Crimes = Non-Family Households + Poverty + Housing \end{align} \]

#> 
#> Call:
#> lm(formula = `Total crimes` ~ `Number of people Poverty` + `Units of Housing` + 
#>     `Non-Family Households`, data = crime_demostats_2014_2016)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -5566  -1438   -261    670  11767 
#> 
#> Coefficients:
#>                            Estimate Std. Error t value    Pr(>|t|)
#> (Intercept)                  62.223    478.030    0.13       0.897
#> `Number of people Poverty`    1.140      0.186    6.14 0.000000011
#> `Units of Housing`           -0.280      0.147   -1.90       0.060
#> `Non-Family Households`       0.315      0.125    2.53       0.013
#>                               
#> (Intercept)                   
#> `Number of people Poverty` ***
#> `Units of Housing`         .  
#> `Non-Family Households`    *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3040 on 119 degrees of freedom
#> Multiple R-squared:  0.507,  Adjusted R-squared:  0.495 
#> F-statistic: 40.8 on 3 and 119 DF,  p-value: <0.0000000000000002


We have chosen to select a significance level of 5%. As the variable Units of Housing has a p-value that is greater than 5%, we will also remove it. We remove this variable..



Model 1.4


\[ \begin{align} log(Total Crimes) = log(Non-Family Households) + Poverty^2 \end{align} \]

#> 
#> Call:
#> lm(formula = log(crime_demostats_2014_2016$`Total crimes`) ~ 
#>     log(`Non-Family Households`) + I(`Number of people Poverty`^2), 
#>     data = crime_demostats_2014_2016)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -2.699 -0.409  0.080  0.283  3.199 
#> 
#> Coefficients:
#>                                      Estimate    Std. Error t value
#> (Intercept)                     1.80379491087 0.61425057588    2.94
#> log(`Non-Family Households`)    0.63077046179 0.07190958756    8.77
#> I(`Number of people Poverty`^2) 0.00000001781 0.00000000464    3.84
#>                                          Pr(>|t|)    
#> (Intercept)                                0.0040 ** 
#> log(`Non-Family Households`)    0.000000000000014 ***
#> I(`Number of people Poverty`^2)            0.0002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.942 on 120 degrees of freedom
#> Multiple R-squared:  0.55,   Adjusted R-squared:  0.542 
#> F-statistic: 73.2 on 2 and 120 DF,  p-value: <0.0000000000000002


To reduce the effects of non-normality of the distribution, we chose to perform some transformations on the variables. All of our variables are statistically significant. Moreover, the Multiple R-squared is not so bad. It seems to be a good model.

#>    log(`Non-Family Households`) I(`Number of people Poverty`^2) 
#>                            1.22                            1.22

We also see that the correlation between the variables is not high at all. For the moment, all the conditions for a linear regression seem to be met.


We still need to check the error distribution to see if our conclusions can be used.

As can be seen on the normal QQ graph, our errors are not normally distributed at the extremes. On the Residuals vs Fitted graph, we see a certain pattern, which would mean that our errors are not constant. With this graph, we also see outliers: observations 8, 49 and 90. These three observations represent the same neighborhood of San Francisco at three different time intervals. However, this neighborhood is a park with very few inhabitants, making the demogrpahic data on this place not very relevant. We have chosen to delete these three observations.




#> 
#> Call:
#> lm(formula = log(`Total crimes`) ~ log(`Non-Family Households`) + 
#>     I(`Number of people Poverty`^2), data = crime_demostats_2014_2016_2)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.2110 -0.4245 -0.0208  0.3998  1.8285 
#> 
#> Coefficients:
#>                                       Estimate     Std. Error
#> (Intercept)                     -1.23189331621  0.58766676503
#> log(`Non-Family Households`)     0.97255478817  0.06793020296
#> I(`Number of people Poverty`^2)  0.00000001182  0.00000000369
#>                                 t value            Pr(>|t|)    
#> (Intercept)                       -2.10              0.0382 *  
#> log(`Non-Family Households`)      14.32 <0.0000000000000002 ***
#> I(`Number of people Poverty`^2)    3.21              0.0017 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.736 on 117 degrees of freedom
#> Multiple R-squared:  0.732,  Adjusted R-squared:  0.728 
#> F-statistic:  160 on 2 and 117 DF,  p-value: <0.0000000000000002
#>    log(`Non-Family Households`) I(`Number of people Poverty`^2) 
#>                            1.22                            1.22


Very little change in the significance and explanation of the variation of the dependent variable by the independent variables by removing the three observations considered as outliers.


Let’s take a close look now to see if the assumptions of normality on errors seem better:

Consistency of errors seem to be respected. However, the extreme values still deviate a little from the values we should have. The distribution of errors does not totally follow a normal law but it is close enough for our model be relevant.




What can we conclude from this model? Since this kind of model is just an explanation of correlation, we cannot establish a causal link between the independent variables and the dependent variable. Moreover, like all social studies, there is a lot noise that makes it difficult to explain the phenomenon that we are studying. But we will nevertheless propose various hypotheses that can explain our results.


First, the two most predictive variables for crime by neighborhood are poverty and the presence of non-family households. This could be reflectiveof the nature of the perpetrators, victims or just the neighborhood in itself. These neighbourhoods probably have a lot of students and young professionals, indicating above average rates of social activity and nightlife. Criminals might choose such neighborhoods to commit crimes such as theft because it is easier to pass unnoticed in active spaces. To confirm this hypothesis, it would have been interesting to add a database of entertainment locations in San Francisco, to have the description of the victims and information on the rate of social interactions by neighborhood. It is also possible that this indicates the demographic categories who are more likely to commit crimes. Risky behaviour such as criminal activity might be more common among people who are poorer and do not have families.

Digging deeper, we have done some geospatial represtations here for differerent variables that we thought could help us clarify wether our hypothesis on socially active neighborhoods, or our hypothesis on the demographic data explaining perpetrators characteristics is more correct. However this mostly inconclusive, although we do see the Tenderloin and South of Market neighborhoods appear on the poverty heatmap. To understand what our model that uses poverty and non-family households actually means in terms of victims and perpetrators of crime, more detailed on these people data is almost certainly required.

Conclusion


In this project we explored different relationships were we put crime as a dependent variable and demographics, coronavirus infections and transportation networks as independant variables. We had a lot of variables for the socio-economic data for which there were significant risks of colinearity, which is why we spent some time on variable selection and modeling for this dataset. We performed some transformations on the variables in order to normalize the distribution of errors and in the end found a rather compelling, if not perfect, model for predicting crime based on these variables.

We performed simpler analyses based on descriptive statistics for the COVID-19 and public transportation, finding a somewhat stronger relationship in the first case than in the second one. In particular, we found a strong citywide effect of the pandemic. However when looking at particularly affected neighborhoods, we did not find that the pandemic had an effect on the relative geographic distribution of crimes. This all indicates that doing data-analyses on trends in citywide data is probably easier than finding compelling differences on smaller geographic units, neighborhoods in this case. We believe this why we found more compelling relationships in our modeling for neighborhood data on the 3rd research question than our other neighborhood level analyses. It would probably be interesting for future projects to find more detailed on the neighborhood of residence for criminals and victims to understand how local crime is, what form of transportation is used by the perpetrators, and what the relationships we find in our demographic model mean exactly. But as it is we believe we have found some compelling insights for crime rates in San Francisco over the past 20 years on which future projects can be built.